We will look at the Ames Housing data. The task is to predict the houses after 2008 based on data up to 2008.
library(AmesHousing)
data(ames_raw,package = "AmesHousing")
ames_raw_2008=ames_raw[ames_raw$`Yr Sold`<2008,]
ames_raw_2009=ames_raw[ames_raw$`Yr Sold`>=2008,]
The loss will be the same as before. If your algorithm decides to pay more than the actual price your company buys. If the predicted price is lower, your company will fail to buy.
calc_loss<-function(prediction,actual){
difpred <- actual-prediction
RMSE <-sqrt(mean(difpred^2))
operation_loss<-abs(sum(difpred[difpred<0]))+sum(0.1*actual[difpred>0])
return(
list(RMSE,operation_loss
)
)
}
There are several categories of feature engineering.
When handed a dataset, it’s easy to jump right into the analysis. This is typical behavior, especially for a novice. However, there is often information that could be explored if you know what you are looking for. There are a few categories of such information.
When you are not the data creator, sometimes you are not given access to certain information. The most common is information that pertains to privacy or protected attributes. This information is often not given to you for reasons external to the project you are working on. However, in certain circumstances, if you know what you are looking for, you might be able to negotiate information that could save you some headaches down the line. Think outside the box and be creative. The important caveat is that obtaining some information could have legal consequences. Web scraping and other means of data collection should be done with care. Some industry such as pharmacies have strict rule that prohibits the use of pharmacy information for their retail purpose.
There are information about places and things on the internet that are easy to incorporate. For example, in housing data, geographic information could be tied to census information. Financial information might require adjusting for inflation, which again can be found on the internet. Other survey information might be available if you care to look for them. One thing to be careful is that not all information that you can find will be useful. You need to balance the time needed vs the benefit of the information.
Coded variables without keys do not make sense but for a computer
they seem like a numeric variable. If not careful, one might include
them as numeric. Take MS SubClass, which codes the building
class.
table(ames_raw$`MS SubClass`)
##
## 020 030 040 045 050 060 070 075 080 085 090 120 150 160 180 190
## 1079 139 6 18 287 575 128 23 118 48 109 192 1 129 17 61
Unfortunately, the help file does not contain detailed information on the codes. But with some research you will be able to find that codes do not have ordering to them. Therefore, you need to think carefully about what matters and then discretize the variable in some ways.
To handle missing data, it’s always essential to consider the context. Data that is missing is not by themselves a problem. The fundamental problem is the bias that these variable might pose down the line if incorporated. Doing a careful imputation takes effort. When time is of a concern, deleting variables with high rate of missingness should be considered.
Garage Yr Blt has 159 observations missing. But if you look
carefully, you will realize that the houses that are missing this
information are the ones that have no garage. This is not missing data
but a coding problem. One must decide what to do with such information
based on the context. You should not fill such missingness with some
arbitrary number.table(ames_raw$`Garage Cars`,is.na(ames_raw$`Garage Yr Blt`))
##
## FALSE TRUE
## 0 0 157
## 1 777 1
## 2 1603 0
## 3 374 0
## 4 16 0
## 5 1 0
missing_data_proportion<-colMeans(is.na(ames_raw))
plot(missing_data_proportion)
which(missing_data_proportion>0.1)
## Lot Frontage Alley Fireplace Qu Pool QC Fence Misc Feature
## 5 8 59 74 75 76
If missingness is intentional, one might add a variable to signify such missingness. You will need to fill the missing value with some value, which depends on the variable.
If MCAR, one could remove the rows with missingness without introducing bias. However, this is a strong assumption that is often not met in practice.
For MAR, regression-based imputation often is used. Many packages
allow you to do these imputations reasonably easily. However, one thing
that you will need to think about is that some imputation method will
work better after transformation then before. This will rely on the
model being used to impute. See mi, mice, etc
for detail.
MNAR variable is hard to deal with. One needs to weigh the cost and benefit of including such variables. An example of such is a variable like income. If all the low-income people are not responding, one might use a small number as a proxy. But if there are reasons to believe there multiple categories of cause they are missing, and there is no way to tell, then you might be better off removing the variable.
Problematic observations such as outliers are hard to find and often require you to revisit this step a few times. This is important because you must deal with them before applying transformations. For example, outliers would distort statistics such as means which would be problematic if you plan to use z-score transformation. When you have a lot of zeros, this could impact how you want to transform a variable. EDA often finds outliers, but they may not pop up until the modeling phase. Truncating or removing data with outliers should be done with caution since they often introduce an unwanted feature in the data.
Here is an illustration of two types of outliers that are harder and easier to find.
dat<-rmvnorm(100,c(0,0),matrix(c(3,2,2,3),2,2))
dat<-rbind(dat,c(7,7), c(-3,4))
par(mfrow=c(1,3))
plot(dat[,1],dat[,2],col=c(rep(1,100),2,4))
plot(dat[,1],col=c(rep(1,100),2,4));
plot(dat[,2],col=c(rep(1,100),2,4))
Look at the basement and the 2nd floor Square footage, you can see that there are bimodality as well as properties that have outliers. This should make you cautious of performing scaling to these variables.
plot(ames_raw$`Bsmt Unf SF`,ames_raw$`2nd Flr SF`)
Context matters when doing feature engineering. Take, for example, the Ames housing data. Ames is a university town where many people have some ties to the university of Iowa. Therefore, looking at things like distance from the university might make sense to include in the analysis. Another thing to think about is things like the Year built. The impact of the year built is not absolute and shifts over the years. Therefore one might want to make a variable that is the age of the house at sales.
# handling Year features
ames_raw$yrs_since_remod <- ames_raw$`Yr Sold` - ames_raw$`Year Remod/Add`
# Total Living Area
ames_raw$TotalArea <- ames_raw$`Gr Liv Area` + ames_raw$`Total Bsmt SF`
# TotalBath
ames_raw$TotalBath <- ames_raw$`Bsmt Full Bath` + 0.5 * ames_raw$`Bsmt Half Bath` + ames_raw$`Full Bath` + 0.5 * ames_raw$`Half Bath`
When the predictor is right skewed they tend to distort the linear model by exhibiting leverage points. Taking a log will resolve such a problem.
p=ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)
p2 <- ggMarginal(p, margins = 'both', fill="skyblue", size=4,type="histogram")
p4=ggplot(ames_raw)+geom_point()+aes(x=`Gr Liv Area`,y=SalePrice)+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=FALSE)+scale_y_log10()+scale_x_log10()
p3 <- ggMarginal(p4, margins = 'both', fill="skyblue", size=4,type="histogram")
gridExtra::grid.arrange(p2,p3,ncol=2)
For linear regression models, centering and scaling does not change the model itself, but they change the interpretability of the model coefficients. Converting all the predictors on a similar scale has its advantage because the size of the coefficient will directly indicate the influence of the predictor. For some hierarchical models, scaling will also help with the convergence problem. But scaling is critical for all the distance-based methods you will encounter later in the semester.
Categorical variables need to be coded appropriately. Dummy coding or one-hot-encoding is one way when the information is nominal. Take, for example, the building type variable by default, it’s a character variable with five values.
table(ames_raw$`Bldg Type`)
##
## 1Fam 2fmCon Duplex Twnhs TwnhsE
## 2425 62 109 101 233
One can use contextual information to convert them into meaningful variables like a single family and multiple families or a shared house. Or use factor, which will, by default, make a dummy coding.
ames_raw$BldgTypeFact<-factor(ames_raw$`Bldg Type`)
head(model.matrix(~BldgTypeFact,ames_raw))
## (Intercept) BldgTypeFact2fmCon BldgTypeFactDuplex BldgTypeFactTwnhs
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 1 0 0 0
## 6 1 0 0 0
## BldgTypeFactTwnhsE
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
By default, R will convert the factors into a dummy, with the first level being the baseline. It’s essential to know how a dummy variable is included in a model as it is model specific.
Not all categorical variable needs a unique category. One might consider grouping some categories so that you have fewer categories to model. For example, the overall condition is rated from 1 to 10, as shown below.
ggplot(ames_raw)+geom_histogram()+aes(x=`Overall Cond`)
It’s important to know which way is better. For the Ames data it is infact 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor
One could convert them into integers since there is explicit ordering. However, the distribution of the variable is uneven, with many observations at five and very few below 5. In such a case, combining the categories into three may be better since the data does not seem to have the resolution to understand the ten levels.
ames_raw$OverallCond3 <- ifelse( ames_raw$`Overall Cond` >5, 3, ifelse( ames_raw$`Overall Cond`<5, 1, 2))
ggplot(ames_raw)+geom_histogram()+aes(x=OverallCond3)
There are various reasons why you need to be selective of what to include. This could be the lack of information from the variable due to the limitations posed by the sample size, contextual reasons, or overlapping information.
For highly correlated variables you might select variables so that correlation does not impact the model building.
# Correlation matrix
numeric_vars = ames_raw %>% select(where(is.numeric))
ggcorr(numeric_vars,hjust = 1)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
Alternatively, you could compress the correlated variable using dimension reduction. However, it’s no free lunch since you need to do all the scaling and missing data processing before you can apply PCA and you need to decide how many components to include. pcaMethods package offers a way to fit a model even in the presence of missing data.
BiocManager::install("pcaMethods")
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'pcaMethods'
library(pcaMethods)
pca_numeric<-pcaMethods::pca(numeric_vars,nPcs=20,scale="uv")
summary(pca_numeric)
## nipals calculated PCA
## Importance of component(s):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## R2 0.2393 0.07939 0.06822 0.0543 0.04926 0.04631 0.03258 0.02859
## Cumulative R2 0.2393 0.31870 0.38692 0.4412 0.49048 0.53679 0.56937 0.59796
## PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
## R2 0.02797 0.02671 0.02583 0.02512 0.02463 0.02379 0.02296 0.02282
## Cumulative R2 0.62593 0.65264 0.67848 0.70359 0.72822 0.75201 0.77497 0.79780
## PC17 PC18 PC19 PC20
## R2 0.02162 0.02113 0.01923 0.01748
## Cumulative R2 0.81941 0.84054 0.85977 0.87725
plot(pca_numeric)
slplot(pca_numeric)
In tidymodels package there is a feature engineering method called
recipes. It used to be it’s own package but now they have
murged the packages. You can find the example for Ames Housing here: https://www.tmwr.org/recipes
Doing things like - Take log of Gr_Liv_Area - Make
Neighborhood that is less than 1% prevalent into “other” -
Dummy code all the nominal predictors
Can be easily done as
library(tidymodels)
simple_ames <-
recipe(SalePrice ~ Neighborhood + `Gr Liv Area` + `Year Built` + `Bldg Type`,
data = ames_raw) %>%
step_log(`Gr Liv Area`, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal_predictors())
However, this is not executed. You will use the framework of tidymodel workflow, which will run these transformation when fitting the model.
Since you’ve worked on it in MA679 please copy and paste your best model here.
lmfit_2008<- lm(log(SalePrice) ~ log(`Lot Area`)+`Year Built`+`Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + log(`Gr Liv Area`) + `Bsmt Full Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` + `Open Porch SF` , data = ames_raw_2008)
summary(lmfit_2008)
##
## Call:
## lm(formula = log(SalePrice) ~ log(`Lot Area`) + `Year Built` +
## `Year Remod/Add` + `Total Bsmt SF` + `1st Flr SF` + log(`Gr Liv Area`) +
## `Bsmt Full Bath` + `TotRms AbvGrd` + `Garage Area` + `Wood Deck SF` +
## `Open Porch SF`, data = ames_raw_2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48256 -0.08492 0.00058 0.09723 0.68601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.573e+00 5.456e-01 -10.214 < 2e-16 ***
## log(`Lot Area`) 8.383e-02 1.094e-02 7.661 3.57e-14 ***
## `Year Built` 2.898e-03 2.298e-04 12.614 < 2e-16 ***
## `Year Remod/Add` 3.078e-03 3.153e-04 9.762 < 2e-16 ***
## `Total Bsmt SF` 1.588e-04 1.967e-05 8.076 1.50e-15 ***
## `1st Flr SF` -4.433e-05 2.275e-05 -1.949 0.05150 .
## log(`Gr Liv Area`) 6.828e-01 2.934e-02 23.276 < 2e-16 ***
## `Bsmt Full Bath` 6.376e-02 1.007e-02 6.331 3.35e-10 ***
## `TotRms AbvGrd` -3.040e-02 5.274e-03 -5.764 1.02e-08 ***
## `Garage Area` 2.378e-04 2.996e-05 7.938 4.41e-15 ***
## `Wood Deck SF` 1.167e-04 3.937e-05 2.964 0.00309 **
## `Open Porch SF` -1.455e-04 7.253e-05 -2.005 0.04513 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1721 on 1306 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8178, Adjusted R-squared: 0.8163
## F-statistic: 532.9 on 11 and 1306 DF, p-value: < 2.2e-16
##
Your answer:
Please write your answer in full sentences.
Please perform feature engineering on the Ames housing data that you think will help with the prediction.
##
##
Your answer:
Please write your answer in full sentences.
Compare the result before and after the feature engineering step.
##
##
Your answer:
Please write your answer in full sentences.
Feature engineering is mostly about context. But does it matter if the prediction is of interest? Is there automatic ways to do all of this that is better? Let’s find out.
Include all the vairables you included as well as anything you want to add to the model.
vars <- c("SalePrice","Lot Area","Gr Liv Area","Full Bath")
#vars <- c("SalePrice")#
train <- ames_raw_2008[, vars]
test <- ames_raw_2009[, vars]
colnames(train) <- make.names(colnames(train))
colnames(test) <- make.names(colnames(test))
# mlr3 TaskRegr
train$SalePrice <- log(train$SalePrice)
test$SalePrice <- log(test$SalePrice)
MLR3 has an auto ML.
https://github.com/a-hanf/mlr3automl/blob/master/vignettes/mlr3automl.md
https://a-hanf.github.io/useR_presentation/useR_2021_mlr3automl.pdf
You will need to install
devtools::install_github('https://github.com/mlr-org/mlr3extralearners')
devtools::install_github('https://github.com/a-hanf/mlr3automl', dependencies = TRUE)
# load packages and data
library(mlr3)
library(mlr3learners)
library(mlr3automl)
# Create a task
task <- as_task_regr(train, target ="SalePrice",id = "ames_raw")
# Auto ML
ames_model = AutoML(task)
train_indices = sample(1:task$nrow, 2/3*task$nrow)
ames_model$train(row_ids = train_indices)
predict_indices = setdiff(1:task$nrow, train_indices)
predictions = ames_model$predict(row_ids = predict_indices)
plot(predictions$response,predictions$truth);abline(0,1)
resampling_result = ames_model$resample()
iml_explainer = ames_model$explain(iml_package = "iml")
iml_pdp = iml::FeatureEffect$new(iml_explainer, feature="Lot.Area", method="pdp")
plot(iml_pdp)
h2o autoML is well known in the field as something pretty powerful. https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
# load packages and data
library(h2o)
# init h2o
h2o.init()
h2o.no_progress()
# upload the data
train_hf <- as.h2o(train)
test_hf <- as.h2o(test)
# fit a model
automl <- h2o.automl(y = "SalePrice", training_frame = train_hf, max_runtime_secs = 300)
model <- automl@leader
predictions=h2o.predict(model,newdata = test_hf)
plot( as.vector(predictions$predict),as.vector(test_hf$SalePrice));abline(0,1)
cor( as.vector(predictions$predict),as.vector(test_hf$SalePrice))
# shutdown h2o
h2o.shutdown(prompt =F)
From CRAN: Fits from simple regression to highly customizable deep neural networks either with gradient descent or metaheuristic, using automatic hyper parameters tuning and custom cost function. A mix inspired by the common tricks on Deep Learning and Particle Swarm Optimization.
https://cran.r-project.org/web/packages/automl/index.html
library(automl)
amlmodel = automl_train_manual(Xref = train,
Yref = train$SalePrice
%>% as.numeric(),
hpar = list(learningrate = 0.01,
minibatchsize = 2^2,
numiterations = 600))
prediction = automl_predict(model = amlmodel, X = test)
plot(prediction,test$SalePrice);abline(0,1)
XG Boost is a popular implementation of gradient boosting method that we will talk about in MA679. Leaving aside the detail, it’s another popular ML method that has a lot of tuning parameters. AutoXGBoost is a function that would search for good choice of these parameters automaticall.
# load library
devtools::install_github("ja-thomas/autoxgboost")
library(autoxgboost)
# create a classification task
trainTask = makeRegrTask(data = train, target = "SalePrice")
# create a control object for optimizer
ctrl = makeMBOControl()
ctrl = setMBOControlTermination(ctrl, iters = 5L)
# fit the model
res = autoxgboost(trainTask, control = ctrl, tune.threshold = FALSE)
# do prediction and print confusion matrix
prediction = predict(res, test)
plot(prediction$data[,1],prediction$data[,2]);abline(0,1)
#caret::confusionMatrix(test$Species, prediction$data$response)
Forester is similar to Autoxgboost in a way it fits tree based models automatically. They can fit xgboost as well as it’s cousins like catboost.
#install.packages("devtools")
#devtools::install_github("ModelOriented/forester")
library(forester)
best_model <- forester::train(data = train,
y = "SalePrice",
type = "auto")
In R, missing data is represented using NA. But other
closely related values are treated the same in some applications. You
can read about it in detail here.
?NA
?NULL
?NaN
?Inf
You need to be careful how the data is represented as missing in the original data and perform appropriate conversions.
Since NA is not a number, you need to use is.na() to find out if a value is NA or not.
x<-c(1,2,NA,3,4,NA)
is.na(x)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
If you want to know the elements that are not NA you can add !
!is.na(x)
## [1] TRUE TRUE FALSE TRUE TRUE FALSE
The problem with NA is that it’s a logical value but without a definition of operations. Simple functions like mean and medians will all return NA if you have any NA in the vector.
x<-c(rnorm(10),NA)
mean(x)
## [1] NA
median(x)
## [1] NA
You can remove NA and calculate these values manually. But for base R
functions, there is a parameter na.rm that does the same
for you.
mean(x,na.rm = T)
## [1] -0.3222407
mean(x[!is.na(x)])
## [1] -0.3222407
median(x,na.rm = T)
## [1] -0.03489692
median(x[!is.na(x)])
## [1] -0.03489692
R does available case analysis by default.
Let’s generate fake x and y and creat a missing version of x.
x <- rnorm(100)
y <- 1+1.4*x+rnorm(100)
xmiss<-x
xmiss[sample(1:100,10)]<-NA
Compare the results.
display(lm(y~x))
## lm(formula = y ~ x)
## coef.est coef.se
## (Intercept) 1.07 0.11
## x 1.51 0.12
## ---
## n = 100, k = 2
## residual sd = 1.09, R-Squared = 0.61
display(lm(y~xmiss))
## lm(formula = y ~ xmiss)
## coef.est coef.se
## (Intercept) 1.05 0.12
## xmiss 1.50 0.13
## ---
## n = 90, k = 2
## residual sd = 1.11, R-Squared = 0.61
How about using more than one predictor?
x1<-rnorm(100)
x2<-rnorm(100)
y12<-1+1.4*x1-0.4*x2+rnorm(100)
x1miss<-x1
x2miss<-x2
x1miss[1:10] <-NA
x2miss[11:20]<-NA
display(lm(y12~x1+x2))
## lm(formula = y12 ~ x1 + x2)
## coef.est coef.se
## (Intercept) 1.04 0.11
## x1 1.24 0.11
## x2 -0.38 0.10
## ---
## n = 100, k = 3
## residual sd = 1.07, R-Squared = 0.62
display(lm(y12~x1miss+x2miss))
## lm(formula = y12 ~ x1miss + x2miss)
## coef.est coef.se
## (Intercept) 1.01 0.12
## x1miss 1.17 0.12
## x2miss -0.42 0.10
## ---
## n = 80, k = 3
## residual sd = 1.04, R-Squared = 0.62
display(lm(y12~x1))
## lm(formula = y12 ~ x1)
## coef.est coef.se
## (Intercept) 1.05 0.11
## x1 1.31 0.12
## ---
## n = 100, k = 2
## residual sd = 1.14, R-Squared = 0.56
display(lm(y12~x1miss))
## lm(formula = y12 ~ x1miss)
## coef.est coef.se
## (Intercept) 1.07 0.12
## x1miss 1.27 0.13
## ---
## n = 90, k = 2
## residual sd = 1.17, R-Squared = 0.54
x<-c(rpois(1000,50),rep(0,100))
logx_na<-ifelse(log(x)==-Inf,NA,log(x))
gp1=ggplot(data.frame(x))+geom_histogram()+aes(x=x)
gp2=ggplot(data.frame(x))+geom_histogram()+aes(x=log(x))
gp3=ggplot(data.frame(x))+geom_histogram()+aes(x=logx_na)
gp4=ggplot(data.frame(x))+geom_histogram()+aes(x=log(x+1))
gridExtra::grid.arrange(gp1,gp2,gp3,gp4,ncol=4)
## Warning: Removed 100 rows containing non-finite values (`stat_bin()`).
## Removed 100 rows containing non-finite values (`stat_bin()`).
Three Missing Data Mechanisms
MCAR mechanism
x <- rnorm(100)
y <- 1+1.4*x+rnorm(100)
When you have missing observations, it helps to understand the extent of missingness.
library(naniar)
vis_miss(riskfactors)
If the missingness is too severe, you should first question the use of that variable.
The next step in the visual assessment will be to see if there are specific patterns in the way data are missing. You will see these patterns when a group of questions are all related to the same issue.
library(UpSetR)
gg_miss_upset(riskfactors)
You can further drill down on the pattern of missings in each column,
broken down by a categorical variable from the dataset using
gg_miss_fct.
gg_miss_fct(x = riskfactors, fct = marital)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `marital = (function (x) ...`.
## Caused by warning:
## ! `fct_explicit_na()` was deprecated in forcats 1.0.0.
## ℹ Please use `fct_na_value_to_level()` instead.
## ℹ The deprecated feature was likely used in the naniar package.
## Please report the issue at <https://github.com/njtierney/naniar/issues>.
When GGPlot ignores the NAs, you can add them back using
geom_miss_point. This allows you to see if there is a
pattern in the missing data that varies by covariates.
x<-rnorm(1000)
y<- c(rnorm(900),rep(NA,100))
ggplot(data.frame(x))+geom_point()+aes(x=x,y=y)
## Warning: Removed 100 rows containing missing values (`geom_point()`).
ggplot(data.frame(x))+geom_point()+aes(x=x,y=y)+geom_miss_point()
## Warning: Removed 100 rows containing missing values (`geom_point()`).
Note that where the NAs are plotted is a little concerning since it’s arbitrary and often leads to confusion. But at the exploration phase, it’s better than ignoring them.
x <- rnorm(100)
xmiss<-x
xmiss[sample(1:100,10)]<-NA
Mean imputation
x_imp_mean <- xmiss
x_imp_mean[is.na(x_imp_mean)]<-mean(x_imp_mean,na.rm=TRUE)
x_imp_median <- xmiss
x_imp_median[is.na(x_imp_mean)]<-median(x_imp_mean,na.rm=TRUE)
Last value carried forward
na.locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v)+1]
}
x_imp_lvcf<-na.locf(xmiss)
Indicator for missing ness + mean imputation
x_imp_mean <- xmiss
x_imp_mean[is.na(x_imp_mean)]<-mean(x_imp_mean,na.rm=TRUE)
x_miss_index<-1*is.na(x_imp_mean)
New category for the missing value
x_cat<- sample(c("A","B","C"),100,TRUE)
x_cat[sample(1:100,10)]<-NA
x_cat_imp<-x_cat
x_cat_imp[is.na(x_cat_imp)]<-"D"
Take a random sample from the observed values and impute.
random.imp <- function (a){
missing <- is.na(a)
n.missing <- sum(missing)
a.obs <- a[!missing]
imputed <- a
imputed[missing] <- sample (a.obs, n.missing, replace=TRUE)
return (imputed)
}
x_imp_rand_simple<-random.imp(xmiss)
Fit regression model on the observed and impute the predicted value. - Deterministic: Use the predicted value - Random: Add random noise
x <- rnorm(100)
y <- 1+1.4*x+rnorm(100)
ymiss<-y
ymiss[sample(1:100,10)]<-NA
lm.fit.model<-lm(ymiss~x)
y_imp_det <-ymiss
y_imp_det[is.na(y_imp_det)]<- predict (lm.fit.model,newdata=data.frame(x=x[is.na(ymiss)]))
y_imp_rand <-ymiss
y_imp_rand[is.na(y_imp_rand)]<- rnorm (sum(is.na(ymiss)), predict (lm.fit.model,newdata=data.frame(x=x[is.na(ymiss)])), sigma.hat (lm.fit.model))
par(mfrow=c(1,3))
plot(x,y,col=1+1*is.na(ymiss),pch=19,main="Original")
plot(x,y_imp_det,col=1+1*is.na(ymiss),pch=19,main="Deterministic")
plot(x,y_imp_rand,col=1+1*is.na(ymiss),pch=19,main="Random")
Regression based imputation can be repeated many times to update the imputation.
x1<-rnorm(100)
x2<-rnorm(100)
y12<-1+1.4*x1-0.4*x2+rnorm(100)
y12miss <- y12
x1miss <- x1
x2miss <- x2
x1miss[1:10] <-NA
x2miss[11:20]<-NA
y12miss[5:15]<-NA
rand.reg.imp <- function (dat.y, dat.x){
missing <- is.na(dat.y)
n.missing <- sum(missing)
dat.y.obs <- dat.y[!missing]
imputed <- dat.y
lm.fit.model <- lm(dat.y~.,data.frame(dat.y,dat.x))
imputed[missing] <- rnorm (n.missing,
predict (lm.fit.model,newdata=data.frame(dat.x[missing,])),
sigma.hat (lm.fit.model))
return (imputed)
}
misdat<-data.frame(y=y12miss,x1=x1miss,x2=x2miss)
fildat<-apply(misdat,2,random.imp)
for(j in 1:100){
for(i in 1:3){
fildat[,i]<-rand.reg.imp(misdat[,i],fildat[,-i])
}
}
Multiple imputation is a way of imputing multiple times to account for the uncertainty in the imputation.
mi has been my personal favorite since I was the
original developer. It uses multiple imputations using chained
equations. This was the first package that thought carefully about
diagnostics and convergence of the multiple imputations.
mdf <- missing_data.frame(misdat)
imp_mi <- mi(mdf, seed = 335)
summary(imp_mi)
## $y
## $y$is_missing
## missing
## FALSE TRUE
## 89 11
##
## $y$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.175343 -0.370213 -0.002965 0.006035 0.346768 1.127787
##
## $y$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.15354 -0.29616 0.04957 0.00000 0.32849 1.23374
##
##
## $x1
## $x1$is_missing
## missing
## FALSE TRUE
## 90 10
##
## $x1$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7697 -0.1334 0.1395 0.1742 0.4191 1.2886
##
## $x1$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.23361 -0.40282 0.07882 0.00000 0.36343 1.01490
##
##
## $x2
## $x2$is_missing
## missing
## FALSE TRUE
## 90 10
##
## $x2$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.975021 -0.526084 -0.278845 -0.263816 -0.006778 0.446671
##
## $x2$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.94084 -0.32633 -0.03362 0.00000 0.41717 1.34881
mi_compdat<-mi:::complete(imp_mi,2)
plot(imp_mi)
Mice is considered the most used software in this space. It predates mi, but it copied many functions from mi and is very similar to mi.
library(mice)
imp_mice <- mice(misdat, m=5, maxit = 50, method = 'pmm', seed = 500)
##
## iter imp variable
## 1 1 y x1 x2
## 1 2 y x1 x2
## 1 3 y x1 x2
## 1 4 y x1 x2
## 1 5 y x1 x2
## 2 1 y x1 x2
## 2 2 y x1 x2
## 2 3 y x1 x2
## 2 4 y x1 x2
## 2 5 y x1 x2
## 3 1 y x1 x2
## 3 2 y x1 x2
## 3 3 y x1 x2
## 3 4 y x1 x2
## 3 5 y x1 x2
## 4 1 y x1 x2
## 4 2 y x1 x2
## 4 3 y x1 x2
## 4 4 y x1 x2
## 4 5 y x1 x2
## 5 1 y x1 x2
## 5 2 y x1 x2
## 5 3 y x1 x2
## 5 4 y x1 x2
## 5 5 y x1 x2
## 6 1 y x1 x2
## 6 2 y x1 x2
## 6 3 y x1 x2
## 6 4 y x1 x2
## 6 5 y x1 x2
## 7 1 y x1 x2
## 7 2 y x1 x2
## 7 3 y x1 x2
## 7 4 y x1 x2
## 7 5 y x1 x2
## 8 1 y x1 x2
## 8 2 y x1 x2
## 8 3 y x1 x2
## 8 4 y x1 x2
## 8 5 y x1 x2
## 9 1 y x1 x2
## 9 2 y x1 x2
## 9 3 y x1 x2
## 9 4 y x1 x2
## 9 5 y x1 x2
## 10 1 y x1 x2
## 10 2 y x1 x2
## 10 3 y x1 x2
## 10 4 y x1 x2
## 10 5 y x1 x2
## 11 1 y x1 x2
## 11 2 y x1 x2
## 11 3 y x1 x2
## 11 4 y x1 x2
## 11 5 y x1 x2
## 12 1 y x1 x2
## 12 2 y x1 x2
## 12 3 y x1 x2
## 12 4 y x1 x2
## 12 5 y x1 x2
## 13 1 y x1 x2
## 13 2 y x1 x2
## 13 3 y x1 x2
## 13 4 y x1 x2
## 13 5 y x1 x2
## 14 1 y x1 x2
## 14 2 y x1 x2
## 14 3 y x1 x2
## 14 4 y x1 x2
## 14 5 y x1 x2
## 15 1 y x1 x2
## 15 2 y x1 x2
## 15 3 y x1 x2
## 15 4 y x1 x2
## 15 5 y x1 x2
## 16 1 y x1 x2
## 16 2 y x1 x2
## 16 3 y x1 x2
## 16 4 y x1 x2
## 16 5 y x1 x2
## 17 1 y x1 x2
## 17 2 y x1 x2
## 17 3 y x1 x2
## 17 4 y x1 x2
## 17 5 y x1 x2
## 18 1 y x1 x2
## 18 2 y x1 x2
## 18 3 y x1 x2
## 18 4 y x1 x2
## 18 5 y x1 x2
## 19 1 y x1 x2
## 19 2 y x1 x2
## 19 3 y x1 x2
## 19 4 y x1 x2
## 19 5 y x1 x2
## 20 1 y x1 x2
## 20 2 y x1 x2
## 20 3 y x1 x2
## 20 4 y x1 x2
## 20 5 y x1 x2
## 21 1 y x1 x2
## 21 2 y x1 x2
## 21 3 y x1 x2
## 21 4 y x1 x2
## 21 5 y x1 x2
## 22 1 y x1 x2
## 22 2 y x1 x2
## 22 3 y x1 x2
## 22 4 y x1 x2
## 22 5 y x1 x2
## 23 1 y x1 x2
## 23 2 y x1 x2
## 23 3 y x1 x2
## 23 4 y x1 x2
## 23 5 y x1 x2
## 24 1 y x1 x2
## 24 2 y x1 x2
## 24 3 y x1 x2
## 24 4 y x1 x2
## 24 5 y x1 x2
## 25 1 y x1 x2
## 25 2 y x1 x2
## 25 3 y x1 x2
## 25 4 y x1 x2
## 25 5 y x1 x2
## 26 1 y x1 x2
## 26 2 y x1 x2
## 26 3 y x1 x2
## 26 4 y x1 x2
## 26 5 y x1 x2
## 27 1 y x1 x2
## 27 2 y x1 x2
## 27 3 y x1 x2
## 27 4 y x1 x2
## 27 5 y x1 x2
## 28 1 y x1 x2
## 28 2 y x1 x2
## 28 3 y x1 x2
## 28 4 y x1 x2
## 28 5 y x1 x2
## 29 1 y x1 x2
## 29 2 y x1 x2
## 29 3 y x1 x2
## 29 4 y x1 x2
## 29 5 y x1 x2
## 30 1 y x1 x2
## 30 2 y x1 x2
## 30 3 y x1 x2
## 30 4 y x1 x2
## 30 5 y x1 x2
## 31 1 y x1 x2
## 31 2 y x1 x2
## 31 3 y x1 x2
## 31 4 y x1 x2
## 31 5 y x1 x2
## 32 1 y x1 x2
## 32 2 y x1 x2
## 32 3 y x1 x2
## 32 4 y x1 x2
## 32 5 y x1 x2
## 33 1 y x1 x2
## 33 2 y x1 x2
## 33 3 y x1 x2
## 33 4 y x1 x2
## 33 5 y x1 x2
## 34 1 y x1 x2
## 34 2 y x1 x2
## 34 3 y x1 x2
## 34 4 y x1 x2
## 34 5 y x1 x2
## 35 1 y x1 x2
## 35 2 y x1 x2
## 35 3 y x1 x2
## 35 4 y x1 x2
## 35 5 y x1 x2
## 36 1 y x1 x2
## 36 2 y x1 x2
## 36 3 y x1 x2
## 36 4 y x1 x2
## 36 5 y x1 x2
## 37 1 y x1 x2
## 37 2 y x1 x2
## 37 3 y x1 x2
## 37 4 y x1 x2
## 37 5 y x1 x2
## 38 1 y x1 x2
## 38 2 y x1 x2
## 38 3 y x1 x2
## 38 4 y x1 x2
## 38 5 y x1 x2
## 39 1 y x1 x2
## 39 2 y x1 x2
## 39 3 y x1 x2
## 39 4 y x1 x2
## 39 5 y x1 x2
## 40 1 y x1 x2
## 40 2 y x1 x2
## 40 3 y x1 x2
## 40 4 y x1 x2
## 40 5 y x1 x2
## 41 1 y x1 x2
## 41 2 y x1 x2
## 41 3 y x1 x2
## 41 4 y x1 x2
## 41 5 y x1 x2
## 42 1 y x1 x2
## 42 2 y x1 x2
## 42 3 y x1 x2
## 42 4 y x1 x2
## 42 5 y x1 x2
## 43 1 y x1 x2
## 43 2 y x1 x2
## 43 3 y x1 x2
## 43 4 y x1 x2
## 43 5 y x1 x2
## 44 1 y x1 x2
## 44 2 y x1 x2
## 44 3 y x1 x2
## 44 4 y x1 x2
## 44 5 y x1 x2
## 45 1 y x1 x2
## 45 2 y x1 x2
## 45 3 y x1 x2
## 45 4 y x1 x2
## 45 5 y x1 x2
## 46 1 y x1 x2
## 46 2 y x1 x2
## 46 3 y x1 x2
## 46 4 y x1 x2
## 46 5 y x1 x2
## 47 1 y x1 x2
## 47 2 y x1 x2
## 47 3 y x1 x2
## 47 4 y x1 x2
## 47 5 y x1 x2
## 48 1 y x1 x2
## 48 2 y x1 x2
## 48 3 y x1 x2
## 48 4 y x1 x2
## 48 5 y x1 x2
## 49 1 y x1 x2
## 49 2 y x1 x2
## 49 3 y x1 x2
## 49 4 y x1 x2
## 49 5 y x1 x2
## 50 1 y x1 x2
## 50 2 y x1 x2
## 50 3 y x1 x2
## 50 4 y x1 x2
## 50 5 y x1 x2
imp_mice$imp # Imputed values
## $y
## 1 2 3 4 5
## 5 0.1915959 -0.40519236 1.39881020 1.8676865 0.30939834
## 6 2.7447702 -0.43398485 1.98581978 1.7757237 1.64084122
## 7 0.3436375 1.64084122 0.56529989 -1.8747174 1.55497973
## 8 2.8354757 4.85985712 1.86768654 5.0739152 1.36313727
## 9 0.7918676 1.39981641 -0.95888615 0.1915959 1.67065916
## 10 -1.1578936 0.15156745 -0.02748227 -1.8747174 2.14273524
## 11 -1.0167486 1.64084122 1.55497973 1.5549797 -1.87471735
## 12 1.3696749 1.42082456 -0.09988091 1.9847025 1.98470245
## 13 2.3478799 -0.09988091 -0.43398485 1.0567470 -0.02748227
## 14 -3.1780567 0.31393114 -0.49478648 -0.3951807 -1.15789359
## 15 2.4006078 2.40060777 2.40060777 2.4825946 1.82065185
##
## $x1
## 1 2 3 4 5
## 1 0.29747424 0.391915163 0.6524310 0.25986764 0.1673495
## 2 0.36433700 0.364336999 1.2176289 0.81349533 0.2737471
## 3 1.52410399 0.836374444 0.4616355 0.83637444 1.5241040
## 4 1.22916446 -0.428969938 0.4616355 0.33699024 0.4616355
## 5 -0.66064519 -2.205324320 0.2737471 0.46163550 -1.0980212
## 6 1.31793154 -0.424439009 1.2176289 -0.47865037 1.1968380
## 7 0.08178517 0.461635504 -0.4649633 -1.05335840 0.6176239
## 8 1.03709719 2.002095744 1.2285925 2.32524759 0.9737953
## 9 0.65243100 0.728737433 -2.0820366 0.08178517 0.8314716
## 10 -1.78130180 0.002099191 -1.1435303 -1.05335840 0.4227256
##
## $x2
## 1 2 3 4 5
## 11 -1.4373575 -0.89689823 -1.74376829 -0.4912884 0.2941387
## 12 -0.5089588 1.30038045 0.02755044 -0.7667865 -0.5089588
## 13 -0.3159401 -0.34441789 -0.01226730 0.2522217 -0.7690259
## 14 0.7377375 -0.49128840 -0.47478500 1.3003804 -0.5551897
## 15 -1.8966227 -0.47478500 -0.05078550 -0.7143761 3.1301698
## 16 -0.5152477 -0.04783456 -1.91444934 0.3942609 0.4804277
## 17 1.1940832 -0.49128840 -0.54307869 -1.8966227 -0.5430787
## 18 -1.9144493 0.52847962 0.50635150 1.3483236 1.3483236
## 19 -0.2463615 -0.83081880 -1.89449232 2.3939096 0.1604178
## 20 0.2844143 -0.93861180 -0.71437609 -1.9802127 0.1115891
comp_mice <- complete(imp_mice,2) # Filled in matrix
Machine learning based imputation is pretty popular nowadays.
missForest is one example of using a random forest
algorithm for imputation.
library(missForest)
imp_forest<- missForest(misdat)
We will not go too much into time series imputation, but it also comes up fairly frequently.
imputeTSMore information can be found here: https://cran.rstudio.com/web/packages/imputeTS/vignettes/Cheat_Sheet_imputeTS.pdf
library("imputeTS")
ggplot_na_distribution(tsAirgap)
imp <- na_kalman(tsAirgap)
ggplot_na_imputations(tsAirgap, imp)